## King Fahd University of Petroleum and Minerals College of Computer Science and Engineering Computer Engineering Department

## ICS 233: COMPUTER ARCHITECTURE & ASSEMBLY LANGUAGE

Term 142 (Spring 2014-2015)
Final Exam
Thursday May 21, 2015

Time: 150 minutes, Total Pages: 12

| Name:_ |                         | ID:                  | Section: |
|--------|-------------------------|----------------------|----------|
| Notes: |                         |                      |          |
| •      | Do not open the exam b  | ook until instructed |          |
| •      | Answer all questions    |                      |          |
| •      | All steps must be shown | 1                    |          |

| • Mobile phones must be switched of | off |
|-------------------------------------|-----|
|-------------------------------------|-----|

• Any assumptions made must be clearly stated

| Question | Max Points | Score |
|----------|------------|-------|
| Q1       | 16         |       |
| Q2       | 24         |       |
| Q3       | 12         |       |
| Q4       | 15         |       |
| Q5       | 16         |       |
| Total    | 83         |       |

Dr. Aiman El-Maleh Dr. Mayez Al-Muhammad (Q1)

(i) You are required to enhance a computer performance for running a given program, and there are two possible improvements that can be made: improve execution time of multiply instructions and improve memory access time. Your program takes 120 seconds to execute. Of this time, 25% is used for multiplication instructions, 35% for memory access instructions, and 40% for other tasks. Suppose the execution time of multiplication instructions was improved by a factor of 2, what is the speedup factor required for memory access instructions to have an overall speedup by a factor of 1.5. What will the new execution time be? (5 points)

(ii) Discuss how the compiler could affect the execution time of a high-level program. (3 points)

(iii) Given the following instruction mix of a program on a RISC processor:

(8 points)

| Class  | CPI | Frequency |
|--------|-----|-----------|
| ALU    | 2   | 50%       |
| Branch | 2   | 20%       |
| Jump   | 1   | 10%       |
| Load   | 4   | 12%       |
| Store  | 3   | 8%        |

- (i) What is the average CPI?
- (ii) What is the percent of time used by each instruction class?
- (iii) How much faster would the program run if load time is reduced to 3 cycles, and two ALU instructions could be executed at once, assuming that the cycle time has increased by 4% and the instruction count has increased by 8%?

[24 Points]

(Q2)

(i) Consider the 5-stage pipelined CPU design given below.



- a) Show the conditions that will be used for generating the ForwardA signals.
- b) Show the control signals that will be used for stalling the pipeline due to data and control hazards along with their conditions.
- c) Add the necessary changes to the design, on the given diagram, to allow it to handle data hazards due to load instructions and control hazards.

(12 Points)

(ii) Consider the following MIPS assembly language code: (12 Points)

```
I1: ADDI $s0, $0, 10

I2: ADDI $s1, $0, 5

I3: SLL $s0, $s0, 4

I4: LW $s2, 0($s0)

I5: ADD $s2, $s2, $s1

I6: SW $s2, 4($s0)
```

a) Determine all RAW data hazards in the above program.

| RAW Hazard | Instruction 1 | Instruction 2 |
|------------|---------------|---------------|
| RAW 1      |               |               |
| RAW 2      |               |               |
| RAW 3      |               |               |
| RAW 4      |               |               |
| RAW 5      |               |               |
| RAW 6      |               |               |

b) Complete the following table showing the timing of the above code on the 5-stage pipeline given in part (i) (IF, ID, EX, MEM, WB) that supports **forwarding**. Draw an arrow showing forwarding between the stage that provides the data and the stage that receives the data. Show all stall cycles (draw an X in the box to represent a stall cycle). Determine the number of clock cycles to execute this code.

|          | 1  | 2  | 3  | 4 | 5  | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|----------|----|----|----|---|----|---|---|---|---|----|----|----|----|----|----|
| I1: ADDI | IF | ID | EX | - | WB |   |   |   |   |    |    |    |    |    |    |
| I2: ADDI |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |
| I3: SLL  |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |
| I4: LW   |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |
| I5: ADD  |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |
| I6: SW   |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |

- (Q3) A series of three conditional branches B1 to B3 are sequentially executed one after the other. When a branch is encountered, it is executed  $\underline{six}$  times. The outcome of each branch (called actual) is listed in the table below as taken (T) or not taken (N).
  - (i) Assume a 1-bit history predictor is used which is initialized to N. Fill in the prediction entry for each branch and evaluate the probability of correct prediction. (4 Points)

| Conditional Branch                |   |   |   |   |   |   |  |
|-----------------------------------|---|---|---|---|---|---|--|
| B1: Prediction                    |   |   |   |   |   |   |  |
| B1: Actual                        | N | T | N | N | T | T |  |
| B2: Prediction                    |   |   |   |   |   |   |  |
| B2: Actual                        | T | N | T | N | T | N |  |
| B3: Prediction                    |   |   |   |   |   |   |  |
| B3: Actual                        | T | T | N | N | T | T |  |
| Probability of Correct Prediction |   |   |   |   |   |   |  |

(ii) Assume the 2-bit history predictor, given below, is used. Denote by T1 the weakly predict Taken and N1 the weakly predict not taken. Assume that the predictor is initialized to N1 (weak predict not taken). Fill in the prediction entry for each branch and evaluate the probability of correct prediction. (Note: fill your predictions as T, T1, N, N1) (4 Points)



| Conditional Branch                |   |   |   |   |   |   |  |
|-----------------------------------|---|---|---|---|---|---|--|
| B1: Prediction<br>B1: Actual      | N | T | N | N | T | T |  |
| B2: Prediction<br>B2: Actual      | Т | N | T | N | T | N |  |
| B3: Prediction<br>B3: Actual      | Т | T | N | N | T | T |  |
| Probability of Correct Prediction |   |   |   |   |   |   |  |

(iii) Assume that a correctly predicted branch incurs zero stalls and a mis-predicted branch incurs 3 stalls. Evaluate the average number of stalls per instruction for using the 1-bit and 2-bit predictors if 20% of the instructions are branches.

(4 Points)

- (Q4) A CPU has an I-Cache, a D-cache and a main-memory. For some reference program, the instruction distribution is as follows: (1) R-type instructions are 40%, (2) Load instructions are 25%, (3) Store instructions are 10%, and (4) and the rest are for the branch instructions. The CPU clock frequency is 2 GHz. The memory specifications are:
  - 1. I-Cache: the cache hit time is 1 ns and the probability of a hit is 0.96
  - 2. D-Cache: the cache hit time is 2 ns and the probability of a hit is 0.92
  - 3. Main memory: the access time is 45 ns.
  - 4. Assume that CPI= 2 clocks when all memory references are hits.
    - (i) Evaluate the main memory access time in clock cycles. (1 Point)
    - (ii) Evaluate the average number of stalls per instruction and the CPI. (4 Points) (Note: Do not mix adding time in ns with time in clock cycles.)

(iii) Evaluate the average memory access time in ns. (4 Points)

- **(iv)** To improve the access time of this memory system, the designer considered 3 options:
  - **a)** Use a larger I-cache for which the probability of a miss dropped by 50% and all other parameters remain the same.
  - **b)** Use an enhanced D-Cache for which the hit probability is increased by 4% and all other parameters remain the same.
  - c) Use a faster Main memory for which the access time drops by 10% and all other parameters remain the same.

Evaluate the CPI for each of above cases and determine the best option.

(6 Points)

## [16 Points]

- (Q5) A processor has a hierarchical memory system that consists of a 4-ways set-associative cache C and a main-memory M. The processor generates a 32-bit memory address. The cache has a total of 128 sets. The block size is 32 bytes, where each word is four bytes. M is byte organized, i.e. the processor address is relative to a byte but four consecutive bytes are read when loading a word. Answer each of the following questions:
  - (i) Find the number of bits in "Offset", "Index", and "TAG". (3 Points)



(ii) Determine the cache capacity (excluding TAGs and controls) and main memory capacity in bytes. (3 Points)

(iii) Shortly describe how to search for a word with a given 32-bit address in the above memory system by considering the case of a cache hit and a cache miss. You may describe the above using an algorithm, a block-diagram, or flow-chart. (4 Points)

| (iv) | Suppose we use a 2-way set associative cache instead of the above but with the same number of entries. (2 Points)  Is the cache hit time likely to increase or decrease: |
|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| (v)  | Suppose we use a fully associative cache instead of the above but with the same capacity. (2 Points)  Is the cache hit time likely to increase or decrease:              |

- (vi) Suppose you observe that cache misses significantly increase beyond some data size. Select one of the following options to improve memory access time: (2 Points)
  - (a) Redesign the above cache with same capacity but larger degree of associativity.
  - (b) Select a new cache with same organization but with larger capacity.
  - (c) Redesign the above cache with same capacity but increasing the block size.